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Summary 
In many conventional scientific investigations with high or ultra-high dimensional 
feature spaces, the relevant features, though sparse, are large in number compared 
with classical statistical problems, and the magnitude of their effects tapers off. It 
is reasonable to model the number of relevant features as a diverging sequence when 
sample size increases. In this article, we investigate the properties of the extended 
Bayes information criterion (EBIC) (Chen and Chen, 2008) for feature selection in lin- 
ear regression models with diverging number of relevant features in high or ultra-high 
dimensional feature spaces. The selection consistency of the EBIC in this situation is 
established. The application of EBIC to feature selection is considered in a two-stage 
feature selection procedure. Simulation studies are conducted to demonstrate the 
performance of the EBIC together with the two-stage feature selection procedure in 
finite sample cases. 

Keywords: Diverging number of parameters, Feature selection, Extended Bayes information 
criterion, High dimensional feature space, Penalized likelihood, Selection consistency. 



1 Introduction 

In the setting of a regression model, if the number of features (covariates) p is of 
the polynomial order or exponential order of the sample size n, i.e., p = 0{n'^) or 
p = 0(exp(?7,'^)), the feature space is referred to as a high-dimensional or ultra-high 
dimensional feature space. Regression problems with high or ultra-high dimensional 
feature spaces arise in many important fields of scientific research such as genomics 
study, medical study, risk management, machine learning, etc.. Such problems are 
generally referred to as small-n-large-p problems. In many small-n-large-p problems 
the relevant (or causal, true, as referred by some other authors) features, though 
sparse, are relatively large in number compared with classical statistical problems, 
and their effects usually taper off to zero from the largest to the smallest. To reflect 
the estimability of the feature effects, it is reasonable to model the number of relevant 
features as a diverging sequence depending on the sample size. [7] and [11] are among 
the earliest papers dealing with diverging number of relevant features. In this article, 
we consider model selection criteria for linear regression models with high or ultra- 
high feature space and diverging number of relevant features. 

In general, there are two goals in model selection. The first one is to select 
models to do prediction and the focus is on prediction accuracy. The second one is 
to identify relevant features and the focus is on selection consistency. In traditional 
model selection problems where the number of features under study is small, these two 
goals might be addressed at the same time. But, in small-n-large-p problems, the two 
goals need to be treated separately. We concentrate on the second goal in this article 
and refer to the problem as feature selection. A model selection criterion is crucial for 
feature selection. The traditional model selection criteria such as Akaike's information 
criterion (AIC) [I], cross-validation (CV) [16], generalized cross- vahdation (GCV) [6] 



and the Bayes information criterion (BIG) [H] are not suitable for feature selection 
in small-n-large-p problems. The CV or GOV, which aims to minimize prediction 
errors, does not address the issue of selection consistency. The AIG and BIG are 
overly liberal; that is, the criteria select far more features than the relevant ones, see, 
[31 [IHl E]. [2] proposed a modified BIG (niBIG) for the study of genetic QTL mapping 
to address problems caused by too many features. |1] developed a family of extended 
Bayes information criteria (EBIG) for feature selection in small-n-large-p problems. 
The family of EBIG is indexed by a parameter 7 in the range [0, 1]. The original BIG 
is a special case of EBIG with 7 = 0. The niBIG is also a special case of EBIG in an 
asymptotic sense; that is, it is asymptotically equivalent to the EBIG with 7 = 1. [1] 
considered the case of high dimensional feature space with fixed number of relevant 
features. They established the selection-consistency of EBIG when p = 0{n'^) and 
7 > 1 — 2^ for any k > 0. 

Model selection criterion for diverging number of relevant features in high or 
ultra-high dimensional feature space is still almost a void. [19] considered a BIG 
type criterion for diverging number of relevant features but their criterion applies 
only when the dimension of the feature space is smaller than n, in fact, they require 
p/n^ < 1 for some < .^ < 1. In this paper, we investigate the property of the 
EBIG when the number of relevant features diverges at the order 0{n'^) for some 
< c < 1 and p = 0{n'^) for any k or p = 0(exp(n'^)) for some < k < 1. 
We identify the conditions under which the EBIG remains selection consistent and 
provide the theoretical proof (Theorem 1). Since the seminal paper on LASSO |17j . 
penalized likelihood methods with various penalty functions have been largely used for 
model selection, see, e.g., [H [H |22]. It has been shown that if the penalty parameter 
in the penalized likelihood is properly chosen the penalized likelihood methods are 



selection consistent under certain conditions, see [HI [HI U3l El [ISl [12] • However, in 
practice, without a proper criterion for the selection of the penalty parameter (which 
corresponds to the selection of model), the selection consistency cannot be realized. 
The commonly used criterion in the penalized likelihood methods, the CV, cannot be 
selection consistent in small-n-large-p problems, as we have already pointed out in the 
previous paragraph. In this paper, we also consider the application of the EBIC for 
the selection of the penalty parameter in penalized likelihood methods. Simulation 
studies are conducted to demonstrate the finite sample properties of the EBIC and 
the selection procedures. 

The remainder of the paper is arranged as follows. In §2, the selection consistency 
of EBIC with diverging number of relevant features are established. In §3, a two- 
stage feature selection procedure with the application of the EBIC is described and 
discussed. In §4, simulation results are reported. Technical details and proofs are 
provided in the Appendix. 

2 Selection consistency of EBIC with diverging 
number of relevant features 

We denote by Pn the number of features under investigation to make its dependence 
on n explicit. Let {yi,Xii, . . . ,Xip^),i = l,...,n, be independent observations. We 
consider the following linear model 

Pn 

Z/i = X] (^njXij + ei, i = l,...,n, (1) 

i=i 

where e^'s are i.i.d. with mean zero and variance a^. In matrix notation, ([T]) is 

expressed as 



where /3„ = (/3„,i, . . . , f3npj , y„ = {yi,..., Vn) and X„ = {xij) ,=i,...,„ . Here Pn is 
either of a polynomial order or an exponential order of n, and /3„ is sparse, meaning 
that only a few of its components are non-zero. 

We first introduce some notations. Let son = {j '■ Pnj 7^ 0, j G {1, . . . ,Pn}}- Let 
s be any subset of {1, . . . ,Pn}- For convenience, we also refer to s as a submodel. 
We denote by X„(s) the matrix composed of the columns of Xn with indices in s. 
Similarly, /3„(s) denotes the vector consisting of components of /3,„ with indices in s. 
Let u{s) denote the number of components in s. Let pon = i^(son)- Let Hn{s) be the 
projection matrix of X„(s), i.e., if„(s) = X„(s)[X„(s)^X„(s)]~^X„(s)-'". Define 



A„(s) = ll/in -if„(s)/in||2 



where /!„ = Ey^ = X„(son)/3„(son) and || ■ II2 is the L2 norm. 

Let Sj be the set of all combinations of j indices in {1, . . . ,Pn}- Interchangeably 
we also call Sj the class of submodels consisting of j features. Let t{Sj) be the size 
oiSj] that is, t{Sj) = {^j). 

The family of EBIC proposed in |1] under model (II]) is defined as 

EBIC^(s)=nln('^^^— ^^^^^^^"j + u{s)\nn + 2-f\nT{Sj), sESj,-f>0. 

The family of EBIC is motivated from a Bayesian framework which gives rise to 
the BIG. The BIG of a model s is an approximation to the minus 2 log-transform 
of the posterior probability of s while the prior probability on each model is equal. 
With the equal prior probabilities, the prior probability on the submodel class Sj is 
proportional to its size t{Sj). This makes BIG favor models with larger number of 
features in small-n-large-p problems. Instead of imposing an equal prior probability 
on each model, the EBIG imposes different prior probabilities on models in different 
submodel classes such that the prior probability on Sj is proportional to t{Sj)~'^. 



The parameter 7 is determined such that the resultant EBIC is selection consistent. 
In the case of high dimensional feature space, i.e., pn = 0{n'^) for any k > , and 
a fixed number of relevant features, [1] showed that if 7 > 1 — l/(2fi;) the EBIC 
is selection consistent. In the following, we deal with the case that the number of 
relevant features diverges and the feature space is high or ultra-high dimensional. 
First we consider the following condition: 
Consistency Condition: 

lim mini — : Sq (Z^ s, vis) < kn} = 00. 

"^~ Ponlnpn 

where kn = kpon for any fixed k > 1. 

This condition is slightly different from what is called the asymptotic identifiability 

condition in [1]. The restriction z/(s) < kn is imposed because in practice only the 

models with size comparable with and smaller than the true model will be considered. 

Implicitly, the consistency condition requires that 



Tl 

— min{|/3„j| : j G So„} ^ 00. (2) 

Ponlnpn 

We now discuss a relationship between the consistency condition above and the well 

known sparse Reisz condition which is given as follows: 

< Cmin < min{Amm(-^n(s)'^X„(s)) : u{s) < kn} 

n 

< max{Amax(-^n(s)^X„(s)) : Z/(s) < kn] < Cmax < OO, 

n 
where Amin and Amax denote the smallest and the largest eigenvalues respectively. If 
Pon is fixed and hence so is {Pnj '■ j £ -Son} then the sparse Reisz condition implies the 
consistency condition as shown in [1]. If pon diverges then the sparse Reisz condition 
together with ([2]) imply the consistency condition. When the number of relevant fea- 
tures diverges, conditions of the type ([2]) are always imposed for selection consistency 

7 



in penalized likelihood procedures, see [231 CSl [12] • As the following proposition im- 
plies, the sparse Reisz condition together with (|2]) are stronger assumptions than the 
consistency condition. 

Proposition 1. Assume Son = {1, 2, . . . ,pon}- Let s^k be the set with the kth element 

of SQn rem,oved. Let k{s) = s^k U s. If ^ is satisfied and 

y . maXfc{||[/-g„(fc(g))]X„({fc})||} 

hm mm^:^(^)<fc„,,o5z^, = oo (3) 

then the consistency condition holds. 

The above proposition is similar to a result in [4] which deals with a high dimensional 
feature space and a fixed number of relevant features. The same as in [1], examples 
can be constructed such that ([3]) holds but the sparse Reisz condition does not hold. 

Condition ([21) determines the divergence pattern of {n,pon,Pn) and the constraint 
on /3„j. Now consider the high and ultra-high dimensional feature spaces separately. 
If Pn = 0{n'^) for any fixed k > and pon = n'^ for some < c < k, ([2]) reduces to 
^j^min{ 1/3^^1 : j G So„} — ?■ oo. The induced constraint on /3„j is that min{|/3^j| : j G 
SQn] must have a magnitude larger than 0(n~'^^~^^). Let h be any number bigger than 
c. Then the following provides a consistency pattern: {n,pon,Pn) = {n, 0{n^), 0{n'^)), 
mm{\(3nj\ ■■ J G Son} = 0(n-(i-^)/2)^ < c < K, c < 6 < 1. If p„ = 0(exp(n'')) 
and pQn = n^ then, by the same argument, ([2]) induces the following consistency 
pattern: (n,pon,Pn) = (n, O(n^), 0(exp(r2''))), min{|/3nj| : j G So„} = 0(n-(^-'')/2), 
0<c, K<1, c+K<b<l. 

We now state the main result on the selection consistency of the EBIC with 
diverging number of relevant features in high or ultra-high dimensional feature spaces. 

Theorem 1. Assume model ([I|j and the consistency condition. In addition, assume 
that ponliipn = o{n), Inpon/lnpn — )■ 5 > 0. Let /c„ = kpon for any constant k > 1. 



Then 

P{mms.,^^s)<k„EBIC^{s) > EBICj{son)} -^ 1, 

if ^ -~> 1±A In" 

'■' ' ^ 1-5 2(1-5) lnp„- 

The following are immediate corollaries of Theorem 1. 

Corollary 1. If Pn = 0{n'^) for any constant k > 0, pon = Po is fixed, the EBIC is 
selection consistent with 7 > 1 — ^^ ~ ^ ~ lH (^'^ong all models s with u{s) < kn- 

Corollary 2. If Pn = 0{n'^) for any constant n > 0, pon = 0{n'^), min{|/3„j| : j G 
son} = 0(n~(^~^^/^), 0<c<fi:, c<6<l, then the EBIC is selection consistent with 
7 > "^+^^0-5 among all models s with u{s) < kn- 

Corollary 3. Ifpn = 0(exp(n'')) forO < k<1, pon = Oiji"), min{|/3„j| : j G Son} = 
0{n~^^~^'>^'^), 0<c, /€< 1, c + K < h < 1, the EBIC is selection consistent with 
7 > 1 — ^l""' among all models s with z/(s) < kn- 

The following lemmas are needed in the proof of Theorem 1. 

In J 

Lemma 1. // )■ 5 asp ^ +00, we have 

mp 

ln(— ^— -)=jlnp(l-5)(l + o(l)). 



Lemma 2. Let xl denote a x^ random variable with degrees of freedom k- If 

r K 

+00 and )■ then 

m 

Pixl >m) = f^(W2)'=/^-^e-'"/2(l + 0(1)), 
uniformly for all k < K . 
The proofs of Lemmas 1 and 2 and Theorem 1 are given in the Appendix. 



m 



3 Application of EBIC in feature selection proce- 
dures 

In this section, we consider the apphcation of EBIC for choosing tuning parameters in 
feature selection procedures using penalized likelihood methods. When the dimension 
of the feature space is high or ultra-high, a natural first step in feature selection is to 
reduce the dimensionality of the feature space by some screening procedure and then 
to apply the penalized likelihood method with the reduced feature space. This has 
become a well-accepted strategy for feature selection with high or ultra-high feature 
space, see, e.g., [101 1201 Ej. In the following, we describe a general feature selection 
procedure of this nature where EBIC is used to choose the penalty parameter in the 
penalized likelihood. 

Screening stage: Let J-'„ denote the set of all the features. This stage screens out 
obviously irrelevant features by a screening procedure and reduces Tn to a set 
5* with dimension smaller than n. The screening procedure we recommend is 
as follows. First using the sure independence screening (SIS) advocated in [10] 
to reduce the dimension of J-'„ to a low power order of ra, say r?!'^ ^ then using 
LASSO by choosing an appropriate penalty parameter to further reduce the 
dimension below n. 

Selection stage: Select features by optimizing a penalized log likelihood of the form 

/„,,(X(5:),/3(5:)) = -21nL(X(5:),/3(5:)) + ^p,(|/3,.|), 

where L(X(iS*), /3(iS*)) is the likelihood function of the model with all features 
in iS* , Pa(') is a penalty function and A is the penalty parameter. An appropriate 
penalty function to use is the SCAD penalty proposed in [9]. The A is chosen 
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by EBIC as follows. For each A, let Sn\ be the set of features with non-zero 
coefficient when ln,\{X (S*) , f3{S*)) is minimized. Compute 

EBIC^(A) = -21nL(X(s„A),/3(s„A)) + K5nA)lnn + 27lnf /^^ 

where /3(s„a) is the maximum likelihood estimate (without penalty) of /3(s„a) 
and 7 is taken to be 1 — „l"" for some C > 2. Let A* be the one which attains 
the minimum EBIC-,,. The set Snx* is taken as the set of selected features. 

We shortly discuss the properties of the above feature selection procedure in the 
following. For a screening procedure, if -P(iS* C J-'n) — )■ 1, as n goes to infinity, the 
screening procedure is said to have the property of sure screening, see ^U\. For a penal- 
ized likelihood function of the above type, if there is A„ such that P{sn\„ = so„) — > 1, 
the penalized likelihood is said to have an oracle property (in a narrower sense). If 
the screening procedure in the screening stage has the property of sure screening, the 
reduced feature space S* will contain all the relevant features with probability con- 
verging to 1 as n goes to infinity. If the penalized likelihood has the oracle property 
with the reduced feature space, there will be a A value such that its corresponding set 
Snx is the same as Son, the true set of relevant features, in the selection stage when 
iS* contains all the relevant features. Then the selection consistency of EBIC will 
guarantee that the true set of relevant feature is selected. Thus the feature selection 
procedure will be selection consistent if the conditions required by the sure screening 
property of the screening procedure, the oracle property of the penalized likelihood 
and the selection consistency of EBIC are met simultaneously. 

Fan and Lv jTO] showed that, under certain conditions (conditions 1-4 in section 
5 of their paper), the SIS has the sure screening property if the dimension of the 
feature space is reduced to an order 0{n^~^) for some ^ > 0. If the tuning parameter in 
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LASSO is chosen such that the number of non-zero coefficients is large enough (smaller 
than n), the LASSO procedure can retain all the true features almost surely as n goes 
to infinity, see [S]. Kim et al. [12] considered the SCAD with diverging number of 
relevant features under the following conditions. CI: There are < c < 6 < 1 and 
Ml > such that pon = ^(n^) and n^^-^'i/'^m\n{\l3nj\ : j e Son} > Mi. C2: There 
exists M2 > such that n~^{Xn{{j})'^ Xn{{j}) < M2, for any j. C3: There exists 
M3 > such that X^i^{n~^ Xn{sno)'^ Xn{sno)) > M3, where Amin denotes the smallest 
eigenvalue. C4: Pn ^ n and the eigenvalues of n~^XjX„ are uniformly bounded 
from both below and above. They showed that under the above conditions the oracle 
property of the SCAD holds. The condition n^^~*-'/^min{|/3j| : j E s^n} > Mi implies 
that ?7,^-^~'^~'^^/^min{|/3j| : j E Son} — ^ 00 for some k <h — c. If C4 is replaced by C4 : 
r(5*) < n and the eigenvalues of n~^X„(s„)"^X„(s„) for any s„ C 5* are uniformly 
bounded from both below and above, then together with C1-C3 the oracle property of 
the SCAD penalized likelihood in the selection stage will be guaranteed. Therefore, 
suppose that conditions 1-4 in [10], C1-C3, C4 and the consistency condition hold, 
then the two-stage procedure described above is selection consistent. The reason we 
recommend a two-step screening procedure is that if only SIS is used to reduce the 
dimensionality below n condition C4 might not hold because SIS does not reduce the 
level of the spurious correlations in the original feature space. On the other hand, 
LASSO does reduce the level of the spurious correlations since it tends to select only 
one of the highly correlated features, see [21], but due to the capacity of the computing 
facilities it might not be able to handle ultra-high dimensional feature space. When 
the two steps are combined it is more likely that C4 will be satisfied while the sure 
screening property is retained. In fact, the conditions in [10] for the sure screening 
property can be much relaxed when the dimensionality is only reduced to a power 
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order of n higher than n. The performance of the feature selection procedure described 
in this section is investigated in simulation studies which are presented in the next 
section. 

4 Simulation studies 

The purpose of the simulation studies is to investigate the applicability of EBIC 
in feature selection procedures and to investigate whether or not the asymptotic 
property of selection consistency can be realized in finite sample situations. To this 
end, the two-stage feature selection procedure discussed in §3 is considered in the 
simulation studies. The R package plus |2T] is used for the computation. We are 
mainly concerned about the selection consistency of the EBIC in the consistent range 
of 7. We take 7 slightly bigger than 1 — ^|"" (in the simulation we take 7 = 1 — ^^ ) 
for demonstrating the performance of the EBIC in finite sample situations. We also 
consider 7 = 0, which corresponds to the original BIC, and 7 = 1, which corresponds 
to an asymptotic form of the mBIC proposed in [2]. Throughout the simulation 
studies, r(5*) is taken to be O.Sra. 

We take the divergence pattern as {n.pQn^Vn) = (ri, c[n°'^^^], [exp(n°'^^)]) for n = 
100, 200, 500 and 1, 000, and c = 1 and 2, which results in the table below: 



n 


100 


200 


500 


1,000 


Pn 


150 


595 


6,655 


74,622 


POn{c = 1) 


4 


6 


8 


9 


POn{c= 2) 


8 


12 


16 


18 



For j G son the parameter (3nj is independently generated as /3„j = ( — 1)" (77,~o-i625 _j_ 
\z\) where u ~ Bernoulli{OA) and z is a normal random variable with mean and sat- 
isfies P{\z\ > 0.1) = 0.25. This ensures, roughly, min{|/3„j| : j G Son} = 0(n~°-^^^^). 
The error variance o"^ is determined by setting the following ratio to certain values 
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when n = 100 and kept unchanged for other n's: 



E{/3*^J:I3*) + a^' 
where S is the covariance matrix of the predictors and the expectation is with respect 
to the generating distribution of /3*. This ratio mimics what is called the heritability 
in broad sense in genetic studies. We considered h = 0.4, 0.6 and 0.8. For each 
simulation setting, 200 data sets are generated and analyzed. The following three 
correlation structures are considered for the covariates: 

Structure I: Power decay correlation. The covariates are generated as a series 
of normally distributed random variables with mean and correlation coefficient 

P^J = 0.5l^-^l 

Structure II: Diagonal block design with equal pairwise correlation. The covari- 
ance matrix is a diagonal block matrix. Each block except the last one is of dimension 
50 X 50. The variances in the blocks are all equal to 1 and the off-diagonal correlations 
are all equal to p = 0.5. 

Structure III: Diagonal block design with uniformly distributed eigenvalues. Un- 
like the diagonal block matrix in Structure 2, each block is first generated such that its 
smallest eigenvalue is 1, largest eigenvalue is 50 and other eigenvalues are uniformly 
distributed over (1,50), and then it is converted into a correlation matrix. 

The finite sample performance of the EBIC is assessed by the positive discovery 
rate (PDR) and false discovery rate (FDR) defined as follows: 

p^p l^jSnX' n Son) iy{SnX*\SOn) 

FUK„ = r , tUtin= -. r , 

l^{SOn) l^{SnX*> 

where Snx* is the set of features selected in the selection stage of the two-stage pro- 
cedure. The asymptotic property of selection consistency is equivalent to 

lim PDR„ = 1 and lim FDR„ = 0, 
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in probability. 

The simulated PDR„ and FDR„ averaged over 200 replicates for each setting are 
reported in Table [H [2] and [3] respectively for correlation stucture I, II and III. In the 
tables, 7bic = corresponding to BIC, 7sc = 1 — lnn/(41np„) corresponding to a 
value in the selection consistent range of 7 and 7mBic = 1 corresponding to niBIC 

The following points are manifest in Tables 1, 2 and 3. (i) The finite sample 
performance of the EBIC closely matches its asymptotic property. That is, under all 
the three correlation structures, for the procedure with EBIC^g^^, the PDR„ and the 
FDR„ approach rapidly to 1 and respectively, as n increases from 100 to 1000, at all 
the three h levels, (ii) The BIC does not appear to be selection consistent. Under all 
the settings, the FDR„ of the procedure with BIC does not reduce as n increases, it 
is in fact quite the opposite, (iii) In general, the PDR„ of the procedure with BIC is 
higher because it always selects much more features. But, as n gets large, the PDR„ 
of EBIC^gf^ quickly becomes comparable with that of the BIC. (iv) For large ra, the 
mBIC is comparable with EBIC^g^^, which reflects the fact that it is also selection 
consistent since 7mBic = 1 is in the consistency range of EBIC. But for small n, it 
loses certain power while overly controlling FDR„. 

A Appendix 

A.l Proof of Lemma 1: 

Proof. Write 



p\ p(p-i)...(p-^- + i) P^[l-l)---[l 



p 



jKp-j) 

Note that 



i_i^V-<A_iV..A-i^^<ri-i^"' 



p / \ PJ \ p J \ p 
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and, see [13], that 



^/2;^_^-i+i/2g-i+i/(i2i+i) < ^-i < v^^-J'+i/2e-i+i/(i2j-) 



We now have 



ln(— ^-— ) <j Inp + (j - 1) ln(l - 1/p) - (j + 1/2) In j + j - —1— - In v^ 

jKp-jV- 12j + 1 

<jlnp-(j + l/2)lnj+j=jlnp[l-V f ' J , 



j \np Imp 

j\np{l-6){l + o{l)). 



(4) 



and 




ln(— ^— -) >j\np+{j - 1) ln(l - ^^) 

jKp-j)! P 




-(j + l/2)hij+j-^-ln. 


/2^ 


/ 7 - 1 \ 
>ilnp+(j-l)ln 1-:^ - 

V P J 


-(j + l/2)lnj -lnV27r 


1 (J-I)ln(l-^) 

=jlnp 1 + -^ ^ 

I jlnp 


(j + l/2)lnj lnV27r 


jlnp jlnp 


=jlnp(l-5)(l + o(l)). 




Lemmal follows from (jlj) and ([5]). 





(5) 



D 



A. 2 Proof of Lemma 2 

Proof. Denote Fk{m) = P{Xk — ''^)- By integration by parts, we obtain 

If /c is even, 
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If k is odd, 
where 



Fi(m) = P{xi >m)^ 2 ^^^ '. ' = ———{ml2f''-^e 



V2^ ^{k/2y ' ' v^(m/2)('=-i)/2 

when m — )■ +00. We can write 

It is straightforward to see that R{k, m) < R{K, m) — )■ when m — )■ +cxd. □ 

A. 3 Proof of Theorem 1 

Proof. Let s be any submodel. Decompose EBIC^(s) — EBIC^(son) as follows: 

EBIC^(s) - EBIC^(son) 

=nln ff/"~i^;^'^|f + K^) -po„)lnn + 27(lnr(5,) -lnr(5,„J) (6) 

=Ti +T2, say, 
where 



j.^ _^ Y^ ylVn - Hn{s)]yn ^ ^^^ yl[In ' Hn{s)]y, 



(7) 



-rain [l I ^" ^^" ~ ^"'^^^J^" ~ ^"[^" ~ Hn{.SQn)]en \ 

^2 =(z^(s) -pon)lnn + 27(lnr(>Si.(s)) - lnr(>Spo„)). 
Case /.- son ^t ■s- 

Without loss of generality, assume a^ = 1. We can write 

el{In-Hn{Son)}en= J] Z^ = (n - Pon) (1 + 0^(1)) = n(l + 0^(1)) , (8) 

1=1 
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where Zj's are i.i.d. standard normal variables, since Hn{sQn) is a projection matrix 
with rank pon- We have 

=A„(s) + 2fil[In - Hn{s)]€n + e^i/„(so„)e„ - elHn{s)en. 
It is trivial that 

e^Hn{sQn)en = POn(l + Op(l)). (I) 

We will show 

msix{elHnis)en, i^{s) < kn} = Op(fc„lnp„), (II) 

and 



|/i„ [/„ - Hn{s)]en\ = yj /^n{s)Op{kn lnp„), (III) 

uniformly for all s with v{s) < kn- Under the assumption of the theorem, 2fc„lnp„ = 
o{n). Then, by the asymptotic identifiability condition, (I), (II) and (III) imply that 

yl[In - Hn{s)]yn - el\In - Hn{Son)]en = A„(S)(1 + Op(l)), (9) 

uniformly for all s with u{s) < kn- It then follows from ([8]) and ([9]) that 

T, = n\n(l + ^^{l + o,{l))Y (10) 

uniformly for all s with u{s) < kn- 

We now prove (II) and (III) in the following. Let m = 2/c„[lnp„ + ln(/c„lnp„)]. It 

k 
is obvious that — — ;■ 0. Note that we can express e^Hn{s)en = xK-^) where j = i^{s)- 

By the Bonferroni inequality, we have 

P(max{e^if„(s)e„ : z/(s) < kn] > m) 



fin 

=P(max{x,'(s) : s G 5„j < A;J > m) < ^r(5,)P(x? > 



m . 



By the fact that t{Sj) = (^") < p{ and Lemma 2, there is some c close to 1, not 
depending on j for j < kn, such that 



j 



— {knln-Pn) ^m^'"^ = — 
m m 



< — iknlia.Pn) ^m^'"^ = — 
where 



m 



(/i;„lnp„)2 



—qL say, 
m 



/ m /2[/c„lnp„ + /c„ln(/i;„lnp„)] /ixn^ 

'" = V(M^^ = V sa^^j^ (i+°(i))<5, 

for some q between and 1, when n is large enough, since g„ — )■ 0. Thus 

P(max{e^i7„(s)e„ : v{s) < A;„} > m) < - V q^ < --^— ^0; (li; 

m ^ — ' m 1 — fl 

i=i ^ 

that is, 

max{e^if„(s)e„ : ^{s) < kn] = m(l + Op(l)) = Op(A;„lnp„), 

which establishes (II). 

For verifying (III), note that we can express 

where Z{s) ~ A^(0, 1). For any s with u{s) < kn, we have 



|yU.„{/„ -i:f„(s)}e„| < A/A„(s)max{|Z(s)| : u{s) < kn}. 

Let m be the same as above. Consider P(max{|Z(s)| : u{s) < kn} > v^)- We have 
P(max{|Z(s)| : u^s) < kn} > y/m) =P(max{|Z(s)| : s G Sj,j < kn} > \/rn) 

<Y,r{S.j)P{Z{s) > v^) = $^r(5,)P(x? > 



J' 



<5^r(5,)P(x?>m) 
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since P{xi ^ f^) < -P(x? ^ '"^) by Lemma 2. We have already shown that the last 
sum converges to zero. This establishes (III). 

Now, putting (E]), dZ]) and f lTU]) together, we have 

EBlC^(s)-EBlC^(son) 

=nln M + ^^(1 + Op(l)) j + (z/(s) - pon) Inn + 27(lnr(5,(,)) - lnr(5p„J) 

>nln[l + Cponlnp„/n(l + Op(l))] - pon(lnn + 27lnp„), 

for some positive C, when n is large enough, by the consistency condition. Then by 
choosing C > 1 + 27, the difference goes to infinity as n — )■ 00. 

Case II: Son C s. 

When So C s, {/„ - if„(s)}X„(so) = 0. Hence, |/^{/„ - if„,(s)}?/„ = e^{/„ - 
Hn{s)}en and 

e^{4 - if„(so)}e„ - e^{/„ - if„(s)}e„ = e^{Hn{s) - Hn{so)}en = X?(-5), 

where x^i^) is a x^ random variable depending on s with degrees of freedom j and 
j = z/(s) — pon- We obtain that 

'^log ^Wt TT I \^ =nlog <^ 1 + 



2( . (12) 

^ ^Xj(g) 

-6n/-if„(so)}6„-X,'(s)- 

As n — !■ cxD, n~^e^{I — i7„(so)}en — )■ o"^ = 1, i.e., 

e^{J-/7„(so)}e„ = n(l + o(l)). (13) 

Let Sj = {s : s e Sj+p,^,So C s}. Note that r(5j) = (P""P0") < pi. Let m^ = 
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2 j [In Pn + ln(j lnp„)]. In the same way as we derive f llip . we have 

P( max > 1) < / -P(max{x(s) : s G Sj} > nij 



i<j<k„~pon m 



j=l 



i^n POn H f^n POn 



where 



Thus, 



, 2 21n(jlnp„) / 2 ^ 

^ ^'jlnp„ j(lnp„)2 ~ V lnp„ 



max{Xj(s) : s E Sj+p,,^,So C s} = mj{l + Op(l)}, (14) 

uniformly for all s with i^(s) < kn and Sq C s. 
It follows from (1121), (IH and ([HD that 



'e^{/-if„(so)}e„\ nm^ 



e^{/ - i/„(s)}e„ ; - h - m,(l + Op(l))] 

<mj{l + Op(l)) < 2j(l + 6) lnp„(l + Op(l)), 
uniformly for all s with z/(s) < /c„ and Sq C s, noting that rrij < 2j[lnp„, + ln((fc„ — 

Pon) lnp„)] = 2j(l + <^) lnpn(l + Op(l)) and nij = 2j(l + 5) lnp„(l + Op(l)) for j = 

kn - Pon- Thus 

Ti>-2j(l + 5)lnp„(l + Op(l)). 

When pon ^ '^(■s) < ^n we have lni^(s)/lnp„ — )■ 5 uniformly, hence, by Lemma 1, 

T2=jlnn + 27(l-5)jlnp„(l + o(l)). 

Finally we have 

EBIC^(s) - EBIC^(son) 

>j Inn + 27(1 - 5)j lnp„,(l + o(l)) - 2j(l + 5) lnp„,(l + Op(l)) > 0, 
uniformly for all s with i/{s) < kn and Sq C s, if n is big enough, when 7 > j^l — 



Inn 
2(l-<5)lnp„ 



D 
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Table 1: The PDR and FDR of the SIS-SCAD-EBIC procedure with 
(power decay correlation) averaged over 200 replicates (the numbers in 
are standard deviations) 

c= 1 



Structure I 
parentheses 







PDR 


FDR 


n h 


7bIC 7sC 7mBIC 


7bIC 7sC 7mBIC 


100 


4 
6 
8 


.726(.242) .450(.291) .384(.288) 
.861(.187) .700(.271) .633(.301) 
.973(.090) .921(.159) .909(.176) 


.571(.212) .074(.205) .050(.181) 
.478(.216) .080(.170) .044(.123) 
.363(.204) .085(147) .056(.120) 


200 


4 
6 

8 


.759(.205) .532(.270) .467(.270) 
.910(.144) .758(.256) .711(.282) 
.989(.056) .957(.105) .947(.128) 


.662(.177) .034(.101) .017(.071) 
.574(.185) .080(.145) .038(.100) 
.389(.200) .060(.115) .045(.105) 


500 


4 
6 

8 


.826(.146) .640(.212) .604(.214) 
.943(.100) .863(.164) .836(.181) 
.994(.035) .983(.060) .980(.067) 


.768(.100) .037(.090) .011(.046) 
.660(.133) .066(.128) .028(.079) 
.504(.190) .027(.073) .019(.065) 


1000 


4 
6 

8 


l.OOO(.OO) .999(.008) .999(.011) 
l.OOO(.OO) l.OOO(.OO) l.OOO(.OO) 
l.OOO(.OO) l.OOO(.OO) l.OOO(.OO) 


.662(.024) .019(.041) .009(.028) 
.531(.037) .019(.041) .008(.026) 
.470(.010) .007(.025) .002(.014) 







c = 2 






PDR 


FDR 


n h 


7bIC 7sC 7mBIC 


7bIC 7sC 7mBIC 


100 


4 
6 

8 


.531(.183) .243(.169) .198(.162) 
.680(.166) .416(.213) .350(.206) 
.850(.153) .708(.225) .628(.248) 


.507(.222) .069(.204) .041(.172) 
.447(.187) .074(.173) .026(.093) 
.373(.163) .118(.143) .068(.118) 


200 


4 
6 

8 


.613(.162) .306(.164) .260(.161) 
.720(.148) .518(.211) .456(.207) 
.895(.125) .745(.199) .703(.217) 


.619(.162) .028(.096) .010(.066) 
.545(.181) .036(.082) .018(.061) 
.447(.164) .086(.117) .053(.096) 


500 


4 
6 
8 


.732(.130) .425(.174) .371(.166) 
.832(.104) .635(.176) .590(.186) 
.956(.067) .875(.135) .847(.157) 


.774(.076) .014(.054) .004(.025) 
.695(.112) .028(.064) .010(.031) 
.535(.159) .098(.121) .068(.104) 


1000 


4 
6 

8 


.758(.108) .537(.161) .491(.164) 
.849(.102) .715(.134) .689(.144) 
.969(.054) .925(.084) .906(.106) 


.825(.055) .012(.040) .005(.025) 
.761(.077) .025(.062) .010(.035) 
.581(.146) .095(.110) .072(.095) 
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Table 2: The PDR and FDR of the SIS-SCAD 
(block with equal pairwise correlation) averaged 
parentheses are standard deviations) 



-EBIC procedure with Structure II 
over 200 replicates (the numbers in 



c= 1 




PDR 


FDR 


n h 


7bIC 7sC 7mBIC 


7bIC 7sC 7mBIC 


100 


4 
6 
8 


.733(.285) .402(.318) .343(.291) 
.933(.154) .772(.297) .703(.321) 
.996(.042) .967(.118) .960(.125) 


.427(.268) .229(.369) .198(.362) 
.340(.213) .117(.197) .094(.207) 
.293(.203) .053(.132) .036(.114) 


200 


4 
6 

8 


.868(.203) .534(.303) .479(.306) 
.994(.039) .931(.168) .889(.214) 
l.OOO(.OO) .996(.031) .994(.040) 


.442(.206) .133(249) .110(.246) 
.321(.173) .107(161) .078(.143) 
.292(.165) .025(.081) .017(.070) 


500 


4 
6 

8 


.948(.093) .754(.178) .723(.184) 
.993(.035) .922(.121) .904(.132) 
l.OOO(.OO) .997(.024) .992(.044) 


.689(.114) .056(.107) .049(.103) 
.626(.127) .031(.080) .019(.064) 
.585(.151) .060(.110) .031(.083) 


1000 


4 
6 

8 


.940(.080) .813(.158) .785(.180) 
.995(.025) .988(.041) .986(.043) 
.999(.010) .998(.017) .996(.024) 


.818(.046) .073(.113) .049(.092) 
.739(.066) .039(.084) .035(.079) 
.653(.107) .024(.070) .017(.061) 


c = 2 




PDR 


FDR 


n h 


7bIC 7sC 7mBIC 


7bIC 7sC 7mBIC 


100 


4 
6 

8 


.430(.239) .193(.174) .173(.164) 
.684(.234) .390(.236) .343(.224) 
.881(.179) .676(.266) .603(.284) 


.449(.294) .310(.411) .295(.408) 
.343(.220) .164(.235) .150(.253) 
.308(.194) .105(.174) .096(.175) 


200 


4 
6 

8 


.489(.206) .199(.142) .165(.133) 
.727(.192) .421(.227) .356(.214) 
.919(.135) .718(.254) .672(.269) 


.416(.235) .134(.275) .115(.259) 
.351(.195) .065(.144) .055(.132) 
.351(.184) .055(.099) .043(.088) 


500 


4 
6 
8 


.664(.137) .258(.132) .238(.132) 
.834). 127) .468(.211) .407(.209) 
.944(.094) .804(.244) .778(.266) 


.669(.145) .031(.099) .020(.076) 
.609(.132) .029(.073) .014(.047) 
.485(.198) .084(.108) .068(.095) 


1000 


4 
6 

8 


.675(.133) .311(.158) .284(.158) 
.882(.134) .551(.234) .496(.240) 
.960(.078) .884(.195) .877(.202) 


.830(.079) .017(.055) .014(.050) 
.744(.115) .060(.108) .033(.073) 
.616(.178) .069(.099) .061(.087) 
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Table 3: The PDR and FDR of the SIS-SCAD-EBIC procedure with Structure 
III (block with uniformly distributed eigenvalues) averaged over 200 replicates (the 
numbers in parentheses are standard deviations) 

c= 1 







PDR 


FDR 


n h 


7bIC 7sC 7mBIC 


7bIC 7sC 7mBIC 


100 


4 
6 
8 


.915(.146) .667(.302) .564(.327) 
.996(.031) .964(.116) .950(.133) 
l.OOO(.OO) l.OOO(.OO) l.OOO(.OO) 


.428(.191) .041(.102) .020(.078) 
.360(.181) .046(.105) .019(.063) 
.326(.165) .038(.096) .011(.051) 


200 


4 
6 

8 


.993(.037) .865(.206) .811(.252) 
l.OOO(.OO) .999(.014) .999(.014) 
l.OOO(.OO) l.OOO(.OO) l.OOO(.OO) 


.575(.162) .050(.101) .024(.073) 
.536(.129) .032(.081) .013(.048) 
.457(.138) .023(.065) .009(.042) 


500 


4 
6 

8 


l.OOO(.OO) .971(.081) .961(.090) 
l.OOO(.OO) l.OOO(.OO) l.OOO(.OO) 
l.OOO(.OO) l.OOO(.OO) l.OOO(.OO) 


.768(.042) .041(.075) .023(.055) 
.704(.058) .022(.060) .010(.043) 
.608(.091) .016(.050) .007(.038) 


1000 


4 
6 

8 


l.OOO(.OO) .999(.011) .997(.017) 
l.OOO(.OO) l.OOO(.OO) l.OOO(.OO) 
l.OOO(.OO) l.OOO(.OO) l.OOO(.OO) 


.790(.040) .023(.046) .008(.028) 
.740(.038) .018(.041) .005(.021) 
.705(.051) .005(.022) .002(.012) 







PDR 


FDR 


n h 


7bIC 7sC 7mBIC 


7bIC 7sC 7mBIC 


100 


4 
6 

8 


.643(.218) 
.911(.141) 
.995(.033) 


240(.201) .155(.179) 
589(.298) .461(.302) 
975(.100) .964(.135) 


.409(.206) .071(.185) .028(.128) 
.346(.168) .092(.163) .045(129) 
.237(.136) .089(.101) .069(.092) 


200 


4 
6 

8 


.801(.147) 
.974(.063) 
.999(.010) 


307(.210) .209(.179) 
817(.198) .742(.236) 
993(.041) .989(.048) 


.536(.136) .050(.142) .013(.061) 
.443(.147) .076(.095) .045(.073) 
.322(.121) .046(.074) .034(.063) 


500 


4 
6 
8 


.933(.076) 
.992(.030) 
.999(.005) 


578(.204) .451(.215) 
946(.073) .930(.094) 
998(.016) .997(.017) 


.723(.079) .035(.067) .009(.036) 
.642(.091) .062(.078) .045(.069) 
.498(.105) .023(.044) .014(.036) 


1000 


4 
6 

8 


.970(.049) 
.997(.013) 
.999(.004) 


780(.170) .688(.207) 
976(.054) .973(.058) 
998(.011) .998(.012) 


.809(.051) .042(.063) .018(.039) 
.738(.059) .030(.053) .024(.042) 
.608(.085) .011(.031) .006(.022) 
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